Loading required package: lattice
Loading required package: survival
Loading required package: Formula
Loading required package: ggplot2
Registered S3 method overwritten by 'data.table':
method from
print.data.table
Attaching package: ‘Hmisc’
The following objects are masked from ‘package:base’:
format.pval, units
Attaching package: ‘dplyr’
The following objects are masked from ‘package:Hmisc’:
src, summarize
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
======================
Welcome to d3heatmap version 0.9.0
Type citation('d3heatmap') for how to cite the package.
Type ?d3heatmap for the main documentation.
The github page is: https://github.com/talgalili/d3heatmap/
Please submit your suggestions and bug-reports at: https://github.com/talgalili/d3heatmap/issues
You may ask questions at stackoverflow, use the r and d3heatmap tags:
https://stackoverflow.com/questions/tagged/d3heatmap
======================
Attaching package: ‘d3heatmap’
The following objects are masked from ‘package:base’:
print, save
Attaching package: ‘lubridate’
The following objects are masked from ‘package:base’:
date, intersect, setdiff, union
The purpose of this notebook is to give data locations, data ingestion code, and code for rudimentary analysis and visualization of COVID-19 data provided by New York Times, [NYT1].
The following steps are taken:
Ingest data
Take COVID-19 data from The New York Times, based on reports from state and local health agencies, [NYT1].
Take USA counties records data (FIPS codes, geo-coordinates, populations), [WRI1].
Merge the data.
Make data summaries and related plots.
Make corresponding geo-plots.
Do “out of the box” time series forecast.
Analyze fluctuations around time series trends.
Note that other, older repositories with COVID-19 data exist, like, [JH1, VK1].
Remark: The time series section is done for illustration purposes only. The forecasts there should not be taken seriously.
From the help of tolower:
capwords <- function(s, strict = FALSE) {
cap <- function(s) paste(toupper(substring(s, 1, 1)),
{s <- substring(s, 2); if(strict) tolower(s) else s},
sep = "", collapse = " " )
sapply(strsplit(s, split = " "), cap, USE.NAMES = !is.null(names(s)))
}
if( !exists("dfNYDataStates") ) {
dfNYDataStates <- read.csv( "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv",
colClasses = c("character", "character", "character", "integer", "integer"),
stringsAsFactors = FALSE )
colnames(dfNYDataStates) <- capwords(colnames(dfNYDataStates))
}
head(dfNYDataStates)
dfNYDataStates$DateObject <- as.POSIXct(dfNYDataStates$Date)
summary(as.data.frame(unclass(dfNYDataStates), stringsAsFactors = TRUE))
Date State Fips Cases Deaths DateObject
2020-03-28: 55 Washington : 578 53 : 578 Min. : 1 Min. : 0 Min. :2020-01-21 00:00:00
2020-03-29: 55 Illinois : 575 17 : 575 1st Qu.: 14840 1st Qu.: 326 1st Qu.:2020-07-14 00:00:00
2020-03-30: 55 California : 574 06 : 574 Median : 104087 Median : 1967 Median :2020-11-25 00:00:00
2020-03-31: 55 Arizona : 573 04 : 573 Mean : 301818 Mean : 5868 Mean :2020-11-25 03:46:21
2020-04-01: 55 Massachusetts: 567 25 : 567 3rd Qu.: 375902 3rd Qu.: 7044 3rd Qu.:2021-04-08 00:00:00
2020-04-02: 55 Wisconsin : 563 55 : 563 Max. :4303611 Max. :65020 Max. :2021-08-20 00:00:00
(Other) :29164 (Other) :26064 (Other):26064
Summary by state:
by( data = as.data.frame(unclass(dfNYDataStates)), INDICES = dfNYDataStates$State, FUN = summary )
Alternative summary:
Hmisc::describe(dfNYDataStates)
if(!exists("dfNYDataCounties") ) {
dfNYDataCounties <- read.csv( "https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-counties.csv",
colClasses = c("character", "character", "character", "character", "integer", "integer"),
stringsAsFactors = FALSE )
colnames(dfNYDataCounties) <- capwords(colnames(dfNYDataCounties))
}
head(dfNYDataCounties)
dfNYDataCounties$DateObject <- as.POSIXct(dfNYDataCounties$Date)
summary(as.data.frame(unclass(dfNYDataCounties), stringsAsFactors = TRUE))
Date County State Fips Cases Deaths DateObject
2021-04-05: 3250 Washington: 15969 Texas : 125930 : 14884 Min. : 0 Min. : 0.0 Min. :2020-01-21 00:00:00
2021-08-03: 3250 Unknown : 13405 Georgia : 82329 53061 : 578 1st Qu.: 150 1st Qu.: 2.0 1st Qu.:2020-08-07 00:00:00
2021-08-04: 3250 Jefferson : 13375 Virginia: 67742 17031 : 575 Median : 888 Median : 17.0 Median :2020-12-11 00:00:00
2021-08-10: 3250 Franklin : 12807 Kentucky: 60652 06059 : 574 Mean : 5435 Mean : 108.1 Mean :2020-12-10 08:46:17
2021-08-20: 3250 Jackson : 12183 Missouri: 58478 04013 : 573 3rd Qu.: 3118 3rd Qu.: 61.0 3rd Qu.:2021-04-16 00:00:00
2021-04-07: 3249 Lincoln : 12145 Illinois: 52104 06037 : 573 Max. :1377253 Max. :33694.0 Max. :2021-08-20 00:00:00
(Other) :1618432 (Other) :1558047 (Other) :1190696 (Other):1620174 NA's :36857
if(!exists("dfUSACountyData")){
dfUSACountyData <- read.csv( "https://raw.githubusercontent.com/antononcube/SystemModeling/master/Data/dfUSACountyRecords.csv",
colClasses = c("character", "character", "character", "character", "integer", "numeric", "numeric"),
stringsAsFactors = FALSE )
}
head(dfUSACountyData)
summary(as.data.frame(unclass(dfUSACountyData), stringsAsFactors = TRUE))
Country State County FIPS Population Lat Lon
UnitedStates:3143 Texas : 254 WashingtonCounty: 30 01001 : 1 Min. : 89 Min. :19.60 Min. :-166.90
Georgia : 159 JeffersonCounty : 25 01003 : 1 1st Qu.: 10980 1st Qu.:34.70 1st Qu.: -98.23
Virginia: 134 FranklinCounty : 24 01005 : 1 Median : 25690 Median :38.37 Median : -90.40
Kentucky: 120 JacksonCounty : 23 01007 : 1 Mean : 102248 Mean :38.46 Mean : -92.28
Missouri: 115 LincolnCounty : 23 01009 : 1 3rd Qu.: 67507 3rd Qu.:41.81 3rd Qu.: -83.43
Kansas : 105 MadisonCounty : 19 01011 : 1 Max. :10170292 Max. :69.30 Max. : -67.63
(Other) :2256 (Other) :2999 (Other):3137
dsNYDataCountiesExtended <-
dfNYDataCounties %>%
dplyr::inner_join( dfUSACountyData %>% dplyr::select_at( .vars = c("FIPS", "Lat", "Lon", "Population") ), by = c( "Fips" = "FIPS" ) )
dsNYDataCountiesExtended
ParetoPlotForColumns( as.data.frame(lapply(dsNYDataCountiesExtended[,c("Cases", "Deaths")], as.numeric)), c("Cases", "Deaths"), scales = "free" )
Note that in the plots in this sub-section we filter out Hawaii and Alaska.
ggplot2::ggplot(dsNYDataCountiesExtended[ dsNYDataCountiesExtended$Lon > -130, c("Lat", "Lon", "Cases")]) +
ggplot2::geom_point( ggplot2::aes(x = Lon, y = Lat, fill = log10(Cases)), alpha = 0.01, size = 0.5, color = "blue" ) +
ggplot2::coord_quickmap()
The most recent versions of leaflet RStudio are having problems with the visualization below.
cf <- colorBin( palette = "Reds", domain = log10(dsNYDataCountiesExtended$Cases), bins = 10 )
m <-
leaflet( dsNYDataCountiesExtended[, c("Lat", "Lon", "Cases")] ) %>%
addTiles() %>%
addCircleMarkers( ~Lon, ~Lat, radius = ~ log10(Cases), fillColor = ~ cf(log10(Cases)), color = ~ cf(log10(Cases)), fillOpacity = 0.8, stroke = FALSE, popup = ~Cases )
m
dsNYDataCountiesExtended
An alternative of the geo-visualization is to use a heat-map plot.
Make a heat-map plot by sorting the rows of the cross-tabulation matrix (that correspond to states):
matSDC <- xtabs( Cases ~ State + Date, dfNYDataStates, sparse = TRUE)
d3heatmap::d3heatmap( log10(matSDC+1), cellnote = as.matrix(matSDC), scale = "none", dendrogram = "row", colors = "Blues") #, theme = "dark")
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette RdYlBu is 11
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette RdYlBu is 11
Returning the palette you asked for with that many colors
Cross-tabulate states with dates over deaths and plot:
matSDD <- xtabs( Deaths ~ State + Date, dfNYDataStates, sparse = TRUE)
d3heatmap::d3heatmap( log10(matSDD+1), cellnote = as.matrix(matSDD), scale = "none", dendrogram = "row", colors = "Blues") #, theme = "dark")
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette RdYlBu is 11
Returning the palette you asked for with that many colors
Warning in RColorBrewer::brewer.pal(n, pal) :
n too large, allowed maximum for palette RdYlBu is 11
Returning the palette you asked for with that many colors
In this section we do simple “forecasting” (not a serious attempt).
Make time series data frame in long form:
dfQuery <-
dfNYDataStates %>%
dplyr::group_by( Date, DateObject ) %>%
dplyr::summarise_at( .vars = c("Cases", "Deaths"), .funs = sum )
dfQueryLongForm <- tidyr::pivot_longer( dfQuery, cols = c("Cases", "Deaths"), names_to = "Variable", values_to = "Value")
head(dfQueryLongForm)
Plot the time series:
ggplot(dfQueryLongForm) +
geom_line( aes( x = DateObject, y = log10(Value) ) ) +
facet_wrap( ~Variable, ncol = 1 )
Fit using ARIMA:
fit <- forecast::auto.arima( dfQuery$Cases )
fit
Series: dfQuery$Cases
ARIMA(5,2,2)
Coefficients:
ar1 ar2 ar3 ar4 ar5 ma1 ma2
-0.0709 -0.7032 -0.4278 -0.3593 -0.5618 -0.6517 0.6317
s.e. 0.0615 0.0445 0.0462 0.0332 0.0428 0.0773 0.0363
sigma^2 estimated as 302286721: log likelihood=-6439.15
AIC=12894.3 AICc=12894.55 BIC=12929.15
Plot “forecast”:
plot( forecast::forecast(fit, h = 20) )
grid(nx = NULL, ny = NULL, col = "lightgray", lty = "dotted")
Fit with ARIMA:
fit <- forecast::auto.arima( dfQuery$Deaths )
fit
Series: dfQuery$Deaths
ARIMA(3,2,2)
Coefficients:
ar1 ar2 ar3 ma1 ma2
0.9997 -0.6314 -0.1766 -1.3792 0.6554
s.e. 0.0562 0.0606 0.0527 0.0440 0.0351
sigma^2 estimated as 140488: log likelihood=-4229.64
AIC=8471.28 AICc=8471.43 BIC=8497.42
Plot “forecast”:
plot( forecast::forecast(fit, h = 20) )
grid(nx = NULL, ny = NULL, col = "lightgray", lty = "dotted")
We want to see does the time series data have fluctuations around its trends and estimate the distributions of those fluctuations. (Knowing those distributions some further studies can be done.)
This can be efficiently using the software monad QRMon, [AAp2, AA1]. Here we load the QRMon package:
#devtools::install_github(repo = "antononcube/QRMon-R")
library(QRMon)
Warning: replacing previous import ‘magrittr::set_names’ by ‘purrr::set_names’ when loading ‘QRMon’
Here we plot the consecutive differences of the cases and deaths:
dfQueryLongForm <-
dfQueryLongForm %>%
dplyr::group_by( Variable ) %>%
dplyr::arrange( DateObject ) %>%
dplyr::mutate( Difference = c(0, diff(Value) ) ) %>%
dplyr::ungroup()
ggplot(dfQueryLongForm) +
geom_line( aes( x = DateObject, y = Difference ) ) +
facet_wrap( ~Variable, ncol = 1, scales = "free_y" )
From the plots we see that time series are not monotonically increasing, and there are non-trivial fluctuations in the data.
Here we take interesting part of the cases data:
dfQueryLongForm2 <-
dfQueryLongForm %>%
dplyr::filter( difftime( DateObject, as.POSIXct("2020-05-01")) >= 0 ) %>%
dplyr::mutate( Regressor = as.numeric(DateObject, origin = "1900-01-01") )
Here we specify a QRMon workflow that rescales the data, fits a B-spline curve to get the trend, and finds the absolute and relative errors (residuals, fluctuations) around that trend:
qrObj <-
QRMonUnit(dfQueryLongForm2 %>% dplyr::filter( Variable == "Cases") %>% dplyr::select( Regressor, Value) ) %>%
QRMonRescale(regressorAxisQ = F, valueAxisQ = T) %>%
QRMonEchoDataSummary %>%
QRMonQuantileRegression( df = 16, probabilities = 0.5 )
$Dimensions
[1] 477 2
$Summary
Regressor Value
Min. :1.588e+09 Min. :0.0000
1st Qu.:1.599e+09 1st Qu.:0.1319
Median :1.609e+09 Median :0.4842
Mean :1.609e+09 Mean :0.4791
3rd Qu.:1.619e+09 3rd Qu.:0.8454
Max. :1.629e+09 Max. :1.0000
Here we plot the fit:
qrObj <- qrObj %>% QRMonPlot(datePlotQ = T)
Here we plot absolute errors:
qrObj <- qrObj %>% QRMonErrorsPlot(relativeErrorsQ = F, datePlotQ = T)
Here is the summary:
summary( (qrObj %>% QRMonErrors(relativeErrorsQ = F) %>% QRMonTakeValue)[[1]]$Error )
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.0035826 -0.0006384 0.0000000 0.0002062 0.0006325 0.0102965
Here we plot relative errors:
qrObj <- qrObj %>% QRMonErrorsPlot(relativeErrorsQ = T, datePlotQ = T)
Here is the summary:
summary( (qrObj %>% QRMonErrors(relativeErrorsQ = T) %>% QRMonTakeValue)[[1]]$Error )
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.6406639 -0.0018016 0.0000000 -0.0009474 0.0019173 0.0536856
[NYT1] The New York Times, Coronavirus (Covid-19) Data in the United States, (2020), GitHub.
[WRI1] Wolfram Research Inc., USA county records, (2020), System Modeling at GitHub.
[JH1] CSSE at Johns Hopkins University, COVID-19, (2020), GitHub.
[VK1] Vitaliy Kaurov, Resources For Novel Coronavirus COVID-19, (2020), community.wolfram.com.
[AA1] Anton Antonov, “A monad for Quantile Regression workflows”, (2018), at MathematicaForPrediction WordPress.
[AAp1] Anton Antonov, Heatmap plot Mathematica package, (2018), MathematicaForPrediciton at GitHub.
[AAp2] Anton Antonov, Monadic Quantile Regression Mathematica package, (2018), MathematicaForPrediciton at GitHub.